Crate tiktoken_rs

Expand description

`tiktoken-rs`

Rust library for tokenizing text with OpenAI models using tiktoken.

This library provides a set of ready-made tokenizer libraries for working with GPT, tiktoken and related OpenAI models. Use cases covers tokenizing and counting tokens in text inputs.

This library is built on top of the tiktoken library and includes some additional features and enhancements for ease of use with rust code.

Examples

For full working examples for all supported features, see the examples directory in the repository.

Usage

Install this tool locally with cargo

cargo add tiktoken-rs

Then in your rust code, call the API

Counting token length

use tiktoken_rs::p50k_base;

let bpe = p50k_base().unwrap();
let tokens = bpe.encode_with_special_tokens(
  "This is a sentence   with spaces"
);
println!("Token count: {}", tokens.len());

Counting max_tokens for a chat completion request

use tiktoken_rs::get_chat_completion_max_tokens;
use async_openai::types::{ChatCompletionRequestMessageArgs, Role};

let messages = vec![
    ChatCompletionRequestMessageArgs::default()
        .content("You are a helpful assistant!")
        .role(Role::System)
        .build()
        .unwrap(),
    ChatCompletionRequestMessageArgs::default()
        .content("Hello, how are you?")
        .role(Role::User)
        .build()
        .unwrap(),
];
let max_tokens = get_chat_completion_max_tokens("gpt-4", &messages).unwrap();
println!("max_tokens: {}", max_tokens);

tiktoken supports these encodings used by OpenAI models:

Encoding name	OpenAI models
`cl100k_base`	ChatGPT models, `text-embedding-ada-002`
`p50k_base`	Code models, `text-davinci-002`, `text-davinci-003`
`p50k_edit`	Use for edit models like `text-davinci-edit-001`, `code-davinci-edit-001`
`r50k_base` (or `gpt2`)	GPT-3 models like `davinci`

See the examples in the repo for use cases. For more context on the different tokenizers, see the OpenAI Cookbook

Encountered any bugs?

If you encounter any bugs or have any suggestions for improvements, please open an issue on the repository.

Acknowledgements

Thanks @spolu for the original code, and .tiktoken files.

License

This project is licensed under the MIT License.

Modules

model
contains information about OpenAI models.
tokenizer
lists out the available tokenizers for different OpenAI models.

Structs

Constants

Functions

byte_pair_encode
byte_pair_split
cl100k_base
Use for ChatGPT models, text-embedding-ada-002 Initializes and returns a new instance of the cl100k_base tokenizer.
cl100k_base_singleton
Returns a singleton instance of the cl100k_base tokenizer. Use for ChatGPT models, text-embedding-ada-002
get_bpe_from_model
Returns a CoreBPE instance corresponding to the tokenizer used by the given model.
get_bpe_from_tokenizer
Returns a CoreBPE instance corresponding to the given tokenizer.
get_chat_completion_max_tokens
Calculates the maximum number of tokens available for chat completion based on the model and messages provided.
get_completion_max_tokens
Calculates the maximum number of tokens available for completion based on the model and prompt provided.
p50k_base
Use for Code models, text-davinci-002, text-davinci-003 Initializes and returns a new instance of the p50k_base tokenizer.
p50k_base_singleton
Returns a singleton instance of the p50k_base tokenizer. Use for Code models, text-davinci-002, text-davinci-003
p50k_edit
Use for edit models like text-davinci-edit-001, code-davinci-edit-001 Initializes and returns a new instance of the p50k_base tokenizer.
p50k_edit_singleton
Returns a singleton instance of the p50k_edit tokenizer. Use for edit models like text-davinci-edit-001, code-davinci-edit-001
r50k_base
Use for GPT-3 models like davinci Initializes and returns a new instance of the r50k_base tokenizer (also known as gpt2)
r50k_base_singleton
Returns a singleton instance of the r50k_base tokenizer. (also known as gpt2) Use for GPT-3 models like davinci